Goto

Collaborating Authors

 main text


OTSS: Output-Targeted Soft Segmentation for Contextual Decision-Weight Learning

arXiv.org Machine Learning

Many machine learning systems make constrained decisions by optimizing factorized objectives, but the context-specific objective is often treated as fixed. We study contextual decision-weight learning: from logged decisions and proxy outputs, learn an optimizer-facing weight vector w(x) over interpretable decision factors z(x,d), rather than a direct policy or generic predictive score. We propose OTSS, an output-targeted soft-segmentation model that deploys the personalized decision-ready weight vector. At the function-class level, the theory highlights a hard-versus-soft distinction. Hard partitions incur an approximation-estimation tradeoff under overlap, while a realizable fixed-K soft class removes the hard-partition approximation floor and attains a parametric rate. We evaluate OTSS in controlled benchmarks with finite evaluation libraries, where the true weight vector and downstream regret can be computed exactly. In the representative overlap setting, OTSS attains the lowest mean regret among the comparators, including EM mixture regression, the strongest soft-mixture baseline in our comparison; it matches EM on coefficient recovery while running about two orders of magnitude faster. In a matched K=5 benchmark, OTSS remains competitive under hard-routed truth and improves as heterogeneity becomes softer and sample size grows. On a fixed Complete Journey retail anchor with real household covariates and action geometry, OTSS again achieves the lowest mean-regret point estimate.


Supplementary for Neural Methods for Point-wise Dependency Estimation

Neural Information Processing Systems

In this section, we shall show detailed derivations for the point-wise dependency estimation methods. Four approaches are discussed: Variational Bounds of Mutual Information, Density Matching, Probabilistic Classifier, and Density-Ratio Fitting. For convenience, we define Ω = X Y. We have PX,Y and PXPY (can also be written as PX PY) be the probability measures over σ algebras over Ω with their probability densities being the Radon-Nikodym derivatives (i.e., p(x,y) = dPX,Y/dµ and p(x)p(y) = dPXPY/dµwith µbeing the Lebesgue measure). These estimators have the logarithm of point-wise dependency (PMI) as the intermediate product, which we will show in the following. We denote Mbe any class of functions m: Ω R. Proposition 1 (INWJ and its neural estimation, restating Nguyen-Wainwright-Jordan bound [5, 18]).


Learning Functional Transduction: S.I. Contents

Neural Information Processing Systems

We propose below the proofs of the results presented in the main text. Most of the arguments are adapted from the development proposed in (Zhang, 2013) which goes beyond real or complex-valued RKBS developed in (Zhang et al., 2009; Song et al., 2013) to develop the notion of vector-valued RKBS. In addition, we note that assumptions regarding the properties of the RKBS of interests such as uniform Fréchet differentiability and uniform convexity have been further relaxed in other works (Xu and Ye, 2019; Lin et al., 2022) but are here sufficient for our discussion since they guarantee the unicity of a semi-inner product x.,.yB compatible with the norm ||.||B (Giles, 1967). S.1.1 Theoretical results Theorem 1 Theorem 1 gathers for the sake of compactness the definition of a vector-valued reproducing kernel Banach space with the properties of existence and unicity of the kernel K. Proof. For any v PV and u PU, the mapping OÞÑ xOpvq,uyU is a bounded linear form in LpBq.


Appendix

Neural Information Processing Systems

AAbout Equation (1) As we discussed in Section 3, label smoothing and focal loss are equivalent to the standard CE loss with an additional maximum-entropy regularizer (see in Equation (1) and (2) in the main text). The proof of Equation (2) can be found in the corresponding paper [4]. SVHN is an image dataset which consists of 32 32 colored images of 0 9 digits. CIFAR-10 and CIFAR-100 consist of 32 32 colored natural images arranged in 10 and 100 classes, respectively. For 20Newsgroups, we use the GloVe word embedding [7] for text representation before the 1D-CNN model and set the embedding dimension as 100.




Supplement to " Uniform Concentration Bounds toward a Unified Framework for Robust Clustering "

Neural Information Processing Systems

For the theoretical exposition, we first establish the following Lemmas. Lemma A.1 proves that the derivative of the function φis bounded in the `2-norm when the domain is restricted to the support of P. Lemma A.1. Lemma A.3 proves that the function fΘ, as a function of Θ, is Lipschitz with respect to the k k norm. Joint first authors contributed equally Corresponding author 35th Conference on Neural Information Processing Systems (NeurIPS 2021). Thus, from equation (1), h φ(PC(θ)) φ(θ),x PC(θ)i 0. (2) We now observe that, dφ(x,θ) dφ(x,PC(θ)) dφ(PC(θ),θ) = h φ(PC(θ)) φ(θ),x PC(θ)i 0. Hence the result.


Polyhedron Attention Module: Learning Adaptive-order Interactions Anonymous Author(s) Affiliation Address email Appendixes1

Neural Information Processing Systems

Contents2 ADeriving Eq. 2. 23 BThe hyperplane set generated by the oblique tree is a superset of that created by the4 ReLU-activated plain DNN 35 CProof of Theorem 1 46 DProof of Theorem 2 57 EProof of Theorem 3 68 FProof of Theorem 4 79 GImplementation Detail 810 We consider a L-layer (L 2) ReLU activated plain DNN module f: Rn0 RnL with input12 x Rp. Eq. 2 in the main text can be30 obtained by rewriting P An oblique tree is a binary tree where each node splits the space by a hyperplane rather than by34 thresholding a single feature. The tree starts with the root of the full input space S, and by recursively35 splitting S, the tree grows deeper. For a D-depth (D 3) binary tree, there are 2D 1 1 internal36 nodes and 2D 1 leaf nodes. As shown in Figure 1, each internal and leaf node maintains a sub-space37 representing a polyhedron in S, and each layer of the tree corresponds to a partition of the input38 space into polyhedrons.


Controlled object Main model Outputfunk(hm) CB(hm) = hˆLfunk(hs,ds) CF(hs) Inputhmhmhs, dshs

Neural Information Processing Systems

There are no explicit equations for the cerebellum traditionally also has access to a desired state ds (in particular, one can consider this a and forward DNI, respectively; L denotes the loss function. In addition, the inverse model of the of a motor area and sensory area, respectively; CB,CF denotes the computation of backward DNI Notation is largely consistent with section 2 of the main text: hm,hs denotes the hidden activity properties of the inverse model of the cerebellum can be set against those of forward DNI (red). Controller Neocortex Main model Cerebellum Synthesiser Forward Model Backward DNIInverse Model Forward DNI be summarised in table S1. In general, the likeness in formulation between DNI and the cerebellar internal model hypothesis can backward DNI where the main model is an motor-associated RNN. In fact, it was recently suggested that the cerebellum out that though the temporal case of forward DNI was not originally considered in [5], there remain learns to mimic the forward computations which then take place in the neocortex.


Details and Ablation Studies for Language Modelling

Neural Information Processing Systems

A.1 Experimental Settings All language models in Table 1 have the same Transformer configuration: a 16-layer model with a hidden size of 128 with 8 heads, and a feed-forward dimension of 2048. We use a dropout [75, 76, 77] rate of 0.1. The batch size is 96 and we train for about 120 epochs with Adam optimiser [78] with an initial learning rate of 0.00025 and 2000 learning rate warm-up steps. All models are trained with a back-propagation span of 256 tokens. During training, these segments are treated independently, except for the + full context cases in Table 1 where the states (both recurrent states and fast weight states) from a segment are used as initialisation for the subsequent segment. The models in + full context cases are also evaluated in the same way by carrying over the context throughout the evaluation text with a batch size of one. For all other cases, the evaluation is done by going through the text with a sliding window of size 256 with a batch size of one. Transformer states are computed for all positions in each window, but only the last position is used to compute perplexity (except in the first segment where all positions are used for evaluation) [2].